Advantage Updating Applied to a Differrential Game

نویسندگان

Mance E. Harmon

Leemon C. Baird

A. Harry Klopf

چکیده

An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. The reinforcement learning algorithm for optimal control is modified for differential games in order to find the minimax Presented at the Neural Information Processing Systems Conference, Denver, Colorado, November 28 December 3, 1994. point, rather than the maximum. Simulation results are compared to the optimal solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual gradient and non-residual gradient forms of advantage updating and Qlearning are compared. The results show that advantage updating converges faster than Q-learning in all simulations. The results also show advantage updating converges regardless of the time step duration; Q-learning is unable to converge as the time step duration grows small. * U.S.A.F. Academy, 2354 Fairchild Dr. Suite 6K41, USAFA, CO 80840-6234 1 ADVANTAGE UPDATING The advantage updating algorithm (Baird, 1993) is a reinforcement learning algorithm in which two types of information are stored. For each state x, the value V(x) is stored, representing an estimate of the total discounted return expected when starting in state x and performing optimal actions. For each state x and action u, the advantage, A(x,u), is stored, representing an estimate of the degree to which the expected total discounted reinforcement is increased by performing action u rather than the action currently considered best. The optimal value function V*(x) represents the true value of each state. The optimal advantage function A*(x,u) will be zero if u is the optimal action (because u confers no advantage relative to itself) and A*(x,u) will be negative for any suboptimal u (because a suboptimal action has a negative advantage relative to the best action). The optimal advantage function A* can be defined in terms of the optimal value function V*: ( ) ( ) [ ] A x u t R x u V x V x t t * * * , , ( ) ( ' ) = − + 1 ∆ ∆ ∆ γ (1) The definition of an advantage includes a 1/Dt term to ensure that, for small time step duration Dt, the advantages will not all go to zero. Both the value function and the advantage function are needed during learning, but after convergence to optimality, the policy can be extracted from the advantage function alone. The optimal policy for state x is any u that maximizes A*(x,u). The notation A x A x u u max ( ) max ( , ) = (2) defines Amax(x). If Amax converges to zero in every state, the advantage function is said to be normalized. Advantage updating has been shown to learn faster than Q-learning (Watkins, 1989), especially for continuous-time problems (Baird, 1993). If advantage updating (Baird, 1993) is used to control a deterministic system, there are two equations that are the equivalent of the Bellman equation in value iteration (Bertsekas, 1987). These are a pair of two simultaneous equations (Baird, 1993): ( ) A x u A x u R V x V x t u t ( , ) max ( , ' ) ( ' ) ( ) ' − = + − γ ∆ ∆ 1 (3) max ( , ) u A x u = 0 (4) where a time step is of duration Dt, and performing action u in state x results in a reinforcement of R and a transition to state xt+Dt. The optimal advantage and value functions will satisfy these equations. For a given A and V function, the Bellman residual errors, E, as used in Williams and Baird (1993) and defined here as equations (5) and (6).are the degrees to which the two equations are not satisfied: ( ) E x u R x u V x V x t A x u A x u t t t t t t t u t 1 1 ( , ) ( , ) ( ) ( ) ( , ) max ( , ' ) ' = + − − + + γ ∆ ∆ ∆ (5) E x u A x u u 2 ( , ) max ( , ) = − (6) 2 RESIDUAL GRADIENT ALGORITHMS Dynamic programming algorithms can be guaranteed to converge to optimality when used with look-up tables, yet be completely unstable when combined with function-approximation systems (Baird & Harmon, In preparation). It is possible to derive an algorithm that has guaranteed convergence for a quadratic function approximation system (Bradtke, 1993), but that algorithm is specific to quadratic systems. One solution to this problem is to derive a learning algorithm to perform gradient descent on the mean squared Bellman residuals given in (5) and (6). This is called the residual gradient form of an algorithm. There are two Bellman residuals, (5) and (6), so the residual gradient algorithm must perform gradient descent on the sum of the two squared Bellman residuals. It has been found to be useful to combine reinforcement learning algorithms with function approximation systems (Tesauro, 1990 & 1992). If function approximation systems are used for the advantage and value functions, and if the function approximation systems are parameterized by a set of adjustable weights, and if the system being controlled is deterministic, then, for incremental learning, a given weight W in the function-approximation system could be changed according to equation (7) on each time step:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement Learning Applied to a Differential Game

متن کامل

روش‌های مدل‌سازی تطوری در اقتصاد (با تاکید بر عناصر مشترک سازنده آنها)

In this paper we have tried mention to some sort of thewell-known evolutionary modeling approaches in economic territory such as Multi Agent simulations, Evolutionary Computation and Evolutionary Game Theory. As it has been mentioned in the paper, in recent years, the number of Evolutionary contributions applied to Multi-Agent models increased remarkably. However until now there is no consensus...

متن کامل

A Game Theory Approach for Solving the Knowledge Sharing Problem in Supply Chain

Knowledge management is the process of capturing, developing, sharing, and effectively using organizational knowledge as we known Knowledge management literature emphasizes the importance of knowledge as a valuable asset for SMEs. This paper highlights the efficient sharing of knowledge as a way of creating core competencies in the SMEs that are in civil construction activities. This perspectiv...

متن کامل

Relationship between Game Location and Match Result with the Amount of Aggression: Iranian Premier League Football Teams

The purpose of this study was to investigate the relationship between game location (host advantage), match result (win, lose or tie) and the level of aggression in football teams of the Iranian Premier League. The study population consisted of Premier League Football teams (League XIII), and 60 matches (related to 4 teams) that were available for the researcher, were selected as the sample. Th...

متن کامل

The competitive advantages analysis of pharmaceutical industry strategic behaviors by game theory

Game theory is the study of mathematical models and cooperation between intelligent rational decision-makers. This paper provides a flexible model to calculate pay-off matrix based on several importance factors. This model is adapted by cooperative game and developed for some competitive advantages sections in pharmaceutical industry. An optimum solution is derived by considering Nash equilibri...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

Advantage Updating Applied to a Differrential Game

نویسندگان

چکیده

منابع مشابه

Reinforcement Learning Applied to a Differential Game

روش‌های مدل‌سازی تطوری در اقتصاد (با تاکید بر عناصر مشترک سازنده آنها)

A Game Theory Approach for Solving the Knowledge Sharing Problem in Supply Chain

Relationship between Game Location and Match Result with the Amount of Aggression: Iranian Premier League Football Teams

The competitive advantages analysis of pharmaceutical industry strategic behaviors by game theory

عنوان ژورنال:

اشتراک گذاری